Project Detail

College Major vs Your Salary — Data Exploration with Pandas

The standard advice is "pick STEM." This analysis uses a WSJ/PayScale survey of 1.2 million Americans to stress-test that claim across 51 undergraduate majors. By engineering salary risk scores, growth trajectories, earnings ceilings, and rank-change metrics, the data reveals a more nuanced picture: STEM leads on starting salary, but HASS majors grow at nearly the same rate. Math more than doubles in salary by mid-career. Economics' top earners out-earn every engineering field. And the safest long-term bet by earnings predictability isn't engineering at all — it's Nursing. The analysis surfaces 8 concrete findings through a full EDA pipeline, correlation analysis, and 7 charts, all committed to the repo and visible without running code.

Data data-analysis python feature-engineering statistical-analysis visualisation

Quick Facts

Tech:
Python pandas NumPy Matplotlib Seaborn Jupyter Notebook CSV

Overview

Problem

When thinking about what to study, salary data exists — but it's buried in flat files with no way to quickly answer the questions that actually matter: which majors pay off fastest, which are the safest long-term bets, and how do broad degree categories really compare? Raw CSVs don't surface those patterns on their own. Without programmatic sorting, filtering, and aggregation, you end up eyeballing rows manually and missing the bigger picture entirely. The real friction is turning a flat table into meaningful salary rankings, risk scores, and group comparisons.

Solution

Loaded the dataset into a pandas DataFrame and ran a structured EDA pipeline: shape inspection, null detection (caught a silent NaN footer row with .tail() and .isna()), cleaning with .dropna(), then index-based lookups via .idxmax() / .idxmin() to surface salary extremes by major name. Engineered seven derived columns — Spread (P90 − P10 as an earnings-risk proxy), Growth % ((Mid − Start) / Start × 100), Safety score (Start / Spread), Start and Mid ranks, Rank_Change, Salary_Band (pd.cut), and Group_Avg_Start (groupby transform). Ranked majors across four lenses, ran a pairwise Pearson correlation matrix with upper-triangle masking, and generated seven publication-quality charts (bar, grouped bar, scatter + regression, heatmap, boxplot, rank-change bar). Results exported to CSV and JSON. A live stats DataFrame built from pre-computed variables surfaces all key findings in a single table that updates automatically on every run.

Challenges

The most significant challenge was data integrity — not technical, but analytical. Early findings confidently claimed that HASS majors beat the average STEM starting salary, and that Philosophy was the biggest rank climber. Both were wrong. Economics is categorised as Business in this dataset, not HASS. Nursing is also Business. Catching these errors required auditing every static claim against actual computed output and rebuilding the findings from scratch. The corrected version — Journalism as the biggest climber, STEM edging HASS on growth 70.40% vs 68.86% — is a better story precisely because it came from verifying the data rather than assuming the obvious interpretation.

Results / Metrics

60+ pandas operations demonstrated across inspection, cleaning, feature engineering, multi-condition filtering, aggregation, groupby, reshaping, apply/map, correlation analysis, styling, and export. Seven charts committed to plots/ and visible without running code. Key findings: Chemical Engineering highest mid-career median at $107,000; Math biggest salary grower at 103.52%; Nursing lowest earnings risk with a $50,700 spread; Economics highest P90 ceiling at $210,000; starting salary predicts mid-career salary (Pearson r = 0.848); STEM out-earns HASS by ~$27,844 at mid-career; HASS and STEM have nearly identical growth trajectories (68.86% vs 70.40%).

Screenshots

Click to enlarge.

Click to enlarge.

No screenshots available yet.

Videos

No videos available yet.